FDB: A Query Engine for Factorised Relational Databases 



Nurzhan Bakibayev, Dan Olteanu, and Jakub Zavodny 
Department of Computer Science, University of Oxford, 0X1 300, UK 

{nurzhan. bakibayev, dan. olteanu, jakub.zavodny}@cs. ox. ac.uk 



o 



(N 



q 

O 



> 

rn 

O 



X 



ABSTRACT 

Factorised databases are relational databases that use com- 
pact factorised representations at the physical layer to re- 
duce data redundancy and boost query performance. 

This paper introduces FDB, an in-memory query engine 
for select-project-join queries on factorised databases. Key 
components of FDB are novel algorithms for query optimi- 
sation and evaluation that exploit the succinctness brought 
by data factorisation. Experiments show that for data sets 
with many-to-many relationships FDB can outperform rela- 
tional engines by orders of magnitude. 

1. INTRODUCTION 

This paper introduces FDB, an in-memory query engine 
for select-project-join queries on factorised relational data. 

At the outset of this work lies the observation that re- 
lations can admit compact, factorised representations that 
can effectively boost the performance of relational process- 
ing. The relationship between relations and their factorised 
representations is on a par with the relationship between 
logic functions in disjunctive normal form and their equiva- 
lent nested forms obtained by algebraic factorisation. 

Example 1. Consider a database of a grocery retailer con- 
taining delivery orders, stock availability at different loca- 
tions, availability of dispatcher units for the individual loca- 
tions, and grocery producers with items they produce and 
locations they supply to (Figure[T|. A query Qi that finds all 
orders with their respective items, possible locations to re- 
trieve them from, and dispatchers available to deliver them, 
returns the following result (shown only partially): 

Qi = Order Mit^n Store Mi„^^ti„„ Disp 



oid item location dispatclier 

01 Milk Istanbul Adnan 

01 Milk Istanbul Yasemin 

01 Milk Izmir Adnan 

01 Milk Antalya Volkan 



This query result can be expressed as a relational ex- 
pression built using singleton relations, union, and product. 



whereby each singleton relation (v) holds one value v, each 
tuple is a product of singleton relations, and an arbitrary 
relation is a union of products of singleton relations: 

(01) X (Milk) X (Istanbul) X (Adnan)U 
(01) X (Milk) X (Istanbul) X (Yasemin)U 
(01) X (Milk) X (Izmir) x (Adnan)U 
(01) X (Milk) X (Antalya) X (Volkan) U . . . 

A more compact equivalent representation can be obtained 
by algebraic factorisation using distributivity of product over 
union and commutativity of product and union: 

(Milk) X (01) X ((Istanbul) X ((Adnan) U (Yasemin))U 

(Izmir) X (Adnan) U (Antalya) X (Volkan) )U 
(Cheese) X ((01) U (03)) X ((Istanbul) X ((Adnan) U (Yascmin))U 

(Antalya) X (Volkan) )U 
(Melon) X ((02) U (03)) X (Istanbul) X ((Adnan) U (Yasemin)) 

This factorised representation has the following structure: 
for each item, we construct a union of its possible orders 
and a union of its possible locations with dispatchers. This 
nesting structure together with the attribute names form 
the schema of the factorised representation, which we call a 
factorisation tree, or f-tree for short. 

Figure [2] depicts several f-trees; the leftmost one (7i) cap- 
tures the nesting structure of the above factorisation. The 
second f-tree (72) is an alternative nesting structure for the 
same query result, where for each location, we construct a 
union of its items and orders and a union of dispatchers: 

(Istanbul) X ((Milk) X (01) U (Cheese) X ((01) U (03))U 

(Melon) X ((02) U (03))) X ((Adnan) U (Yasemin) )U 
(Izmir) X (Milk) X (01) X (Adnan)U 
(Antalya) X ((Milk) X (01) U (Cheese) X ((01) U (03))) X (Volkan) 

The factorised result of the query Q2 ~ Produce MguppHcr 
Serve over the f-tree Ts given in Figure [2] is: 

(Gunoy) X ((Milk) U (Choose)) X (Antalya)U 

(Dikici) X (Milk) x ((Istanbul) U (Izmir) U (Antalya) )U 

(Byzantium) X (Melon) x (Istanbul) □ 

Factorisations are ubiquitous. They are arguably most 
known for minimisation of Boolean functions [8] but can 
be useful in a number of read-optimised database scenarios. 
The scenario we consider in this paper is that of factorising 
large intermediate and final results to speed-up query eval- 
uation on data sets with many-to-many relationships. A 
further scenario we envisage is that of compiled databases: 
these are static databases, such as databases encoding the 
human genome [TS], that can be aggressively factorised to 
efficiently support a particular scientific workload. In prove- 
nance and probabilistic databases, factorisations of prove- 
nance polynomials [11] are used for compact encoding of 
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Figure 1: An example database for a grocery retailer. 
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Figure 2: Factorisation trees used in Example [TJ From left to right: 7i and 72 for the result of query Qi; Ts 
and 74 for the result of Q2; T5 is obtained after joining 7i and Ti on item, and Ts is Ts after joining on location. 



large provenance (the GeneOntology database has records 
with 10MB provenance) J^ and for efficient query evalua- 
tion |16l I22| . Factorisations are a natural fit whenever we 
deal with a large space of possibilities or choices. For in- 
stance, data models for design specifications, such as the 
AND/OR trees [ill, are based on incompleteness and non- 
determinism and are captured by factorised representations. 
Formalisms for incomplete information, such as world-set 
decompositions [H [l7], rely on factorisations of universal 
relations encoding very large sets of possible worlds; they 
are products of unions of products of tuples. Outside data 
management scenarios, factorised relations can be used to 
compactly represent the space of feasible solutions to config- 
uration problems in constraint satisfaction, where we need to 
connect a fixed finite set of given components so as to meet 
a given objective while respecting given constraints [S]. 

Factorised representations have several key properties that 
make them appealing in the above mentioned scenarios. 

They can be exponentially more succinct than the rela- 
tions they encode. For instance, a product of n relations 
needs size exponential in n for a relational result, but only 
linear in the size of the input relations for a factorised re- 
sult. Recent work has established tight bounds on the size 
of factorised query results [TS]: For any select-project-join 
query Q, there is a rational number s{Q) such that for any 
database D, there exists a factorised representation E of 
Q(D) with size OdDp*^'), and within the class of repre- 
sentations whose structures are given by factorisation trees, 
there is no factorisation of smaller size. The parameter s{Q) 
is the fractional edge cover number of a particular subquery 
of Q, and there are arbitrarily large queries Q for which 
s{Q) = 1. Moreover, the exponential gap between the sizes 
of E and of (3(D) also holds between the times needed to 
compute E and Q(D) directly from the input database D. 

Further succinctness can be achieved using dictionary- 
based compression and null suppression of data values [5D] . 
Compressing entire vertical partitions of relations as done in 
c-store 9 is not compatible with our factorisation approach 
since it breaks the relational structure. 

Notwithstanding succinctness, factorised representations 
of query results allow for fast (constant-delay) enumeration 
of tuples. More succinct representations are definitely pos- 
sible, e.g., binary join decompositions 10 or just the pair 
of the query and the database 6 , but then retrieving any 



tuple in the query result is already NP-hard. Factorised 
representations can thus be seen as compilations of query 
results that allow for efficient subsequent processing. 

By construction, factorised representations reduce redun- 
dancy in the data and boost query performance using a mix- 
ture of vertical (product) and horizontal (union) data par- 
titioning. This goal is shared with a large body of work on 
normal forms )2j ^-nd columnar stores 7 that considers join 
(or general vertical) decompositions, and with partitioning- 
based automated physical database design [3l [13]. In the 
latter case, the focus is on partitioning input data such that 
the performance of a particular workload is maximised. 

Finally, factorised representations are relational algebra 
expressions with well-understood semantics. Their relational 
nature sets them apart from XML documents, object-oriented 
databases, and nested objects [2], where the goal is to avoid 
the rigidity of the relational model. Moreover, in our set- 
ting, a query result can admit several equivalent factorised 
representations and the goal is to find one of small size. 
The Verso project 1 points out compactness and modelling 
benefits of non-first-normal-form relations and considers hi- 
erarchical data representations that are special cases of fac- 
torised representations. It does not focus on factorisations 
and thus neither on the search for ones of small sizes. 

A factorised database presents relations at the logical layer 
but uses succinct factorised representations at the physical 
layer. The FDB query engine can thus not only compute 
factorised query results for input relational databases, but 
can evaluate queries directly on input factorised databases. 

Example 2. Consider now the query Qi Niot-ation.itGm Q2 
on factorised representations: Find possible suppliers of or- 
dered items. Joining the above factorisations over the f-trees 
Ti and Tz on the attributes location and item is not imme- 
diate, since tuples with equal values for location and item 
appear scattered in the factorisation over Tz- If we restruc- 
ture the factorisation of Q2's result to follow the f-tree Ta so 
that tuples are grouped by item first, we obtain 

(Milk) X ((Guncy) X {Antalya)U 

(Dikici) X ((Istanbul) U (Izmir) U (Antalya))) 
(Cheese) X (Guney) X (Antalya)U 
(Melon) X (Byzantium) X (Istanbul), 

which can be readily joined with the factorisation over 7i 
on the attribute item, since both factorisations have items 



as topmost values. The factorisation of the join on item 
follows the f-tree Ts, where we simply merged the roots of 
the two f-trees. An excerpt of this factorisation is 

(Milk) X (01) X ((Istanbul) X ((Adnan) U (Yascmin))U 

(Izmir) X (Adnan) U (Antalya) X (Volkan)) 
X ((Gunoy) X (Antalya)U 

(Dikici) X ((Istanbul) U (Izmir) U (Antalya))) U . . . , 

To perform the second join condition on location, we first 
need for each item to rearrange the subexpression for sup- 
pliers and locations, so that it is grouped by locations as 
opposed to suppliers. This amounts to swapping supplier 
and location in Ts. The join on location can now be per- 
formed between the possible locations of each item. The 
obtained factorisation follows the schema Te in Figure O □ 

Examples [T] and [5] highlight challenges involved in com- 
puting factorised representations of query results. 

Firstly, a query result may have different (albeit equiv- 
alent) factorised representations whose sizes can differ by 
an exponential factor. We seek f-trees that define succinct 
representations of query results for all input (relational or 
factorised) databases. Such f-trees can be statically derived 
from the query and the input schema, but are independent 
of the database content. Query optimisation thus has to 
consider two objectives: minimising the cost of computing a 
factorised query result from the (possibly factorised) input 
database, and minimising the size of this output representa- 
tion. In addition to the standard query operators selection, 
projection, and product, the search space for a good query 
and factorisation plan, or f-plan for short, needs to consider 
specific operators for restructuring schemas and factorisa- 
tions. We propose two such operators: a swap operator, 
which exchanges a given child with its parent in an f-tree, 
and a push-up operator, which moves an entire sub-tree up 
in the f-tree. For instance, the swap operator is used to 
transform the f-tree Ti into 71 in Figure [2] The selection 
operator is used to merge the item nodes in the f-trees 7i and 
Ta and create the f-tree Ts. The transformation of Ts into 
Te , which corresponds to a join on location, needs a swap of 
supplier and location and a merge of the two location nodes. 

Secondly, we would like to compute the factorised query 
result as efficiently as possible. This means in particular 
that we must avoid the computation of intermediate results 
in relational, un-factorised form. Our query engine has al- 
gorithms for each operator selection, projection, product, 
swap, and push-up. These algorithms use time (quasi)linear 
in the sizes of input and output representations and ensure 
that the f-tree of the resulting factorisation is optimal with 
respect to tight size bounds that can be derived from the 
input f-tree and the operator. 

The main contributions of this paper are as follows: 

• We address new challenges to query optimisation in 
the presence of factorised data and restructuring op- 
erators. In addition to the cost of computing the fac- 
torised query result, we also need to consider the size 
of the resulting factorisation. 

We give exhaustive and heuristic optimisation algo- 
rithms for computing f-plans whose outcomes are fac- 
torised query results. As cost metric, we use selectivity 
and cardinality estimates and a parameter that defines 
tight bounds on the sizes of the factorised result and 
of the temporary results. 



• We give algorithms for the evaluation of each f-plan 
operator on factorised data. They are optimal with 
respect to time complexity and to tight size bounds 
inferred from the input f-tree and the operator. 

• The optimisation and evaluation algorithms have been 
implemented in the FDB in-memory query engine. 

• We report on an extensive experimental evaluation 
showing that FDB can outperform a homebred in- 
memory and two open-source (SQLite and PostgreSQL) 
relational query engines by orders of magnitude. 

2. F-REPRESENTATIONS AND F-TREES 

We next recall the notions of factorised representations 
and factorisation trees, as well as results on tight size bounds 
for such factorised representations over factorisation trees |19] . 

Factorised representations of relations are algebraic ex- 
pressions constructed using singleton relations and the rela- 
tional operators union and product. 

Definition 1. A factorised representation E , or f-iepresen- 
tation for short, over a set S of attributes and domain 2? is 
a relational algebra expression of the form 

• 0, the empty relation over schema S; 

• , the relation consisting of the nullary tuple, if <S = 0; 

• {A: a), the unary relation with a single tuple with value 
a, if iS = {A} and a is a value in the domain V; 

• (E), where E is an f-representation over <S; 

• El U ■ ■ ■ U En, where each Ei is an f-representation 
over <S; 

• i?i X ■ • ■ X En, where each Ei is an f-representation 
over Si and S is the disjoint union of all Si. 

An expression {A: a) is called an A-singleton and the ex- 
pression () is called the nullary singleton. The size \E\ of an 
f-representation E is the number of singletons in E. 

Any f-representation over a set S of attributes can be in- 
terpreted as a database over schema S. Example [l] gives sev- 
eral f-representations, where singleton types are dropped for 
compactness reasons. For instance, (Istanbul) x ((Adnan) U 
(Yasemin)) represents a relation with schema {location, dis- 
patcher} and tuples (Istanbul, Adnan), (Istanbul, Yasemin). 

F-representations form a representation system for rela- 
tional databases. It is complete in the sense that any databa- 
se can be represented in this system, but not injective since 
there exist different f-representations for the same database. 
The space of f-representations of a database is defined by 
the distributivity of product (x) over union (u). Under 
the RAM model with uniform cost measure, the tuples of 
a given f-representation E over a set S of attributes can be 
enumerated with 0(|i?|) space and precomputation time, 
and 0{\S\) delay between successive tuples. 

Factorisation trees define classes of f-representations over 
a set of attributes and with the same nesting structure. 

Definition 2. A factorisation tree, or f-tree for short, over 
a schema S of attributes is an unordered rooted forest with 
each node labelled by a non-empty subset of S such that 
each attribute of S labels exactly one node. 

Given an f-tree T, an f-representation over T is recursively 
defined as follows: 



• If T is a forest of trees Ti, ■ ■ ■ ,Tk, then 

E = El X ■■■ X Ek 

where each Ei is an f-representation over %. 

• If T is a single tree with a root labelled hy {A^, . . . ,Ak} 
and a non-empty forest W of children, then 

E = \J^{Ai:a) X ■■■ X (Ak-.a) X Ea 

where each Ea is an f-representation over U and the 
union IJ^ is over a collection of distinct values a. 

• If T is a single node labelled by {Ai, . . . ,Ak}, then 

-E = Ua(^i:a> X ■■■ X (Ak-.a). 

• If T is empty, then E = $ oi E = {). 

Attributes labelling the same node in T have equal val- 
ues in the represented relation. The shape of T provides 
a hierarchy of attributes by which we group the tuples of 
the represented relation: we group the tuples by the values 
of the attributes labelling the root, factor out the common 
values, and then continue recursively on each group using 
the attributes lower in the f-tree. Branching into several 
subtrees denotes a product of f-representations over the in- 
dividual subtrees. Examples [T] and [2] give six f-trees and 
f-representations over them. 

For a given f-tree T over a set S of attributes, not all 
relations over 5 have an f-representation over T. However, 
if a relation admits an f-representation <& over T, then $ is 
unique up to commutativity of union and product. 

Example 3. The relation R = {(1, 1), (1, 2), (2, 2)} over 
schema {A, B} does not admit an f-representation over the 
forest of f-trees {A} and {-B}, since there are no sets of 
values a and b such that R is represented by (Ua(^-'*)) ^ 
(Uj,(_B:6)). Its f-representation over the f-tree with root A 
and child B is (A:l) x ((B: 1) U (B:2)) U (A:2> x {B:2).a 

F-trees of a query. Given a query Q = n'paip{Ri x • ■ • x 
Rn), we can derive the f-trees that define factorisations of 
the query result Q(D) for any input database D. We con- 
sider f-trees where nodes are labelled by equivalence classes 
of attributes in V; the equivalence class of an attribute A is 
the set of A and all attributes transitively equal to A in y. 

In addition, the attributes labelling the nodes have to sat- 
isfy a so-called path constraint: all dependent attributes can 
only label nodes along a same root-to-leaf path. The at- 
tributes of a relation are dependent, since in general we can- 
not make any independence assumption about the structure 
of a relation, cf. Example [31 Attributes from different rela- 
tions can also be dependent. If we join two relations, then 
their non-join attributes are independent conditioned on the 
join attributes. If these join attributes are not in the pro- 
jection list V, then the non-join attributes of these relations 
become dependent. 

The path constraint is key to defining which f-trees rep- 
resent valid nesting structures for factorised query results. 

Proposition 1. Given a query Q, an f-tree T ofQ satis- 
fies the path constraint if and only if for any input database 
D the query result Q(D) has an f-representation over T. 

Tight size bounds for f-representations over f-trees. 

Given any f-tree T, we can derive tight bounds on the size 
of f-representations over T in polynomial time. 



For any root-to-leaf path p in T, consider the hypergraph 
whose nodes are the attributes classes of nodes in p and 
whose edges are the relations containing these attributes. 
The edge cover number of p is the minimum number of edges 
necessary to cover all attributes in p. We can lift edge covers 
to their fractional version [12]. The fractional edge cover 
number is the cost of an optimal solution to the following 
linear program with variables {xji.}^^i: 

minimise J^i ^Ri 

subject to T.^■.R^ercl(A) 
XR^ >0 



xr^ > 1 for all attribute classes A, 
for all i. 



For each relation Ri with attributes on p, its weight is 
given by the variable xr. . Each attribute class A on p has 
to be covered by relations rel(.4) with attributes in A such 
that the sum of the weights of these relations is greater than 
1. The objective is to minimise the sum of the weights of all 
relations. In the non-weighted version, the variables xr^ can 
only be assigned the values and 1, whereas in the weighted 
version, the variables can be any positive rational number. 

For an f-tree T, we define s{T) as the maximum such 
fractional edge cover number of any root-to-leaf path in T. 

Example 4- Each f-tree T except for Ts in Figure [S] has 
s{T) — 2, while s{Tii) = 1- In Ta, both root-to-leaf paths 
supplier — item and supplier — location can be covered by 
relations Produce and Serve respectively. O 

For any database D and f-tree T, the size of the f-represen- 
tation of the query result over T is OdDI"' '), and there 
exist arbitrarily large databases D for which the size of the 
f-representation over T is SldDI"' '). Given D and T, f- 
representations of the query result (5(D) over the f-tree T 
can be computed in time 0(|Q1 • |Dp '), where T is an 
extension of T with nodes for all attributes in the input 
schema and not in the projection list V; detailed treatment 
of this result is given in prior work !19. . More succinct f- 
representations thus have a smaller parameter s{T), which 
can be obtained by decreasing the length of root-to-leaf 
paths in T and increasing the width of T while preserving 
the path constraint. 

We next define s{Q) as the minimal s(T) for any f-tree T 
of Q. Then, for any database D, there is an f-representation 
of (5(D) with size at most IDp*^^ , and this is asymptotically 
the best upper bound for f-representations over f-trees. 

Example 5. In Example [l] we have s{Qi) = 2 since Qi 
admits no f-tree with s{T) < s(7i) = 2. However, s{Q2) = 
1, since Ts is an f-tree of Q2 and 3(73) = 1. □ 

The size bound IDp*^' can be asymptotically smaller 
than the size of the query result (5(D). For such queries, 
computing and representing their result in factorised form 
can bring exponential time and space savings in comparison 
to the traditional representation as a set of tuples. 

Example 6. Consider relations Ri over schemas {Ai,Bi) 
and the query Qn = cr$(i?i x ■ ■ • x 7?„), where $ = /\i{Bi = 
^i+i). This is a chain of n — 1 equality joins. The result 
(5n(D) can be as large as |D|®'"', while s{Qn) = 9(logn) 
and hence there exist factorised representations of (5„(D) 
with size at most |D|®''°«5"^ The value s{T) = e(logn) is 
witnessed by an f-tree T with depth log n. □ 
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Figure 3: Transformations performed by f-plan operators depicted on f-trees. 



3. QUERY EVALUATION 

In this section we present a query evaluation technique on 
f-representations. We propose a set of operators that map 
between f-representations over f-trees. In addition to the 
standard relational operators select, project, and Cartesian 
product, we introduce new operators that can restructure 
f-representations and f-trees. Restructuring is sometimes 
needed before selections, as exemplified in the introduction. 
Any select-project-join query can be evaluated by a sequen- 
tial composition of operators called an f-plan. 

We consider f-representations over f-trees as defined in 
Section (2] F-trees conveniently represent the structure of 
factorisations as well as attributes and equality conditions 
on the attributes. An f-tree uniquely determines (up to com- 
mutativity of U and x ) the f-representation of a given rela- 
tion. Therefore, the semantics of each of our operators may 
be described solely by the transformation of f-trees T ^-^ 7"'. 
We also present efficient algorithms to carry out the transfor- 
mations on f-representations. These algorithms are almost 
optimal in the sense that they need at most quasilinear time 
in the sizes of both input and output f-representations. 

Proposition 2. The time complexity of each f-plan op- 
erator is 0(|TpA^log A^), where N is the sum of sizes of the 
input and output f-representations and T is the input f-tree. 

We assume that for any union expression [J^ in the input 
f-representation, the values a occur in increasing order, and 
that the path constraint holds for the input f-tree. Our 
algorithms preserve these two constraints. 

We also introduce the notion of normalised f-trees, whose 
f-representations cannot be further compacted by factoring 
out subexpressions. We define an operator for normalising 
f-trees, and all other operators expect normalised input f- 
trees and preserve normalisation. 

3.1 Restructuring Operators 

The Normalisation Operator factors out expressions co- 
mmon to all terms of a union. We first present a simple 
one-step normalisation captured by the push-up operator 
i/jg, and then normalise an f-tree by repeatedly applying the 
push-up operator bottom-up to each node in the f-tree. 

Consider an f-tree T, a node A and its child B in T. If A 
is not dependent on B nor on its descendants, the subtree 
rooted at B can be brought one level up (so that B becomes 
sibling of A) without violating the path constraint. Propo- 
sition [T] guarantees that there is an f-representation over the 
new f-tree. Lifting up a node can only reduce the length of 
root-to-leaf paths in T and thus decrease the parameter s{T) 



and the size of the f-representation, cf. Section (2] Since the 
transformation only alters the structure of the factorisation, 
the represented relation remains unchanged. 

Figure [3l^ a) shows the transformation of the relevant frag- 
ment of T, where Ta and Tb denote the subtrees under A 
and B. F-representations over this fragment have the form 

*i = U„ (M:a>x(U, (B:6>xn)xK) 

and change into 

*2 = (U(B:&>xF,)x(U„(^:a)xK), 

where each Ea is over Ta, each Ft is over Tb, and {A: a) 
stands for {Ai :a) x • ■ ■ x {An :a) in case Ai to A^ are the 
attributes labelling node A; the case of {B : b) is similar. 
Since neither B nor any node in Tb depend on A, all copies 
of (ijjj {B:b) X Fb) in $1 are equal, so the transformation 
amounts to factoring out subexpressions over the subtree 
rooted at B. In any f-representation over T, the change 
shown above occurs for all unions over A, and can be exe- 
cuted in linear time in one pass over the f-representation. 

Definition 3. An f-tree T is normalised if no node in T 
can be pushed up without violating the path constraint. 

Any f-tree T can be turned into a normalised one as fol- 
lows. We traverse T bottom up and push each node B and 
its subtree upwards as far as possible using the operator 773. 
In case a node A is pushed up, we mark it so that we do 
not consider it again. If it is marked, so are all the nodes 
in its subtree, and at least one of them is dependent on the 
parent of A (or ^ is a root) . The parent of A and its subtree 
do not change anymore after A is marked, so A cannot be 
brought upwards again. All nodes are marked after at most 
|Tp applications of the push-up operator, so the resulting f- 
tree is normalised. Since the size of the f-representation over 
T decreases with each push-up, the time complexity of nor- 
malising an f-representation is linear in the size of the input 
f-representation. This procedure defines the normalisation 
operator tj. In the remainder we only consider normalised 
f-trees and operators that preserve normalisation. 

Example 7. Let us normalise the left f-tree below with 
relations over schemas {A, B}, {B' , C}, {C, D}, {D', E}. 



B,B' 

1 


M> 


B,B' 

1 


n- 


B,B' 
/ \ 


A 
1 




A 
1 




D,D' A 

/ \ 


D,D' 

1 




D,D' 
/ \ 




E C,C' 


C,C" 
1 




E C,C' 






1 

E 











The above transformation is obtained by iI)e followed by 
4'{D,D'}- We can bring up E since it is not dependent on 
its parent in the left f-tree. We then mark E. We also mark 
{C, C'}, since it cannot be brought upwards. The lowest 
unmarked node is now {D, -D'}. It can be brought upwards 
next to its parent A since A is not dependent on it nor on 
any of its descendants. The resulting f-tree is normalised. □ 

The Swap Operator xa,b exchanges a node B with its 
parent node ^ in T while preserving the path constraint 
and normalisation of T. We promote B to be the parent 
of A, and also move up its children that do not depend 
on A. The effect of the swapping operator xa,b on the 
relevant fragment of T is shown in Figure E^b) , where 7b 
and Tab denote the collections of children of B that do not 
depend, and respectively depend, on A, and Ta denotes the 
subtree under A. Separate treatment of the subtrees Tb 
and Tab is required so as to preserve the path constraint 
and normalisation. The resulting f-tree has the same nodes 
as T and the represented relation remains unchanged. 

Any f-representation over the relevant part of the input 
f-tree T in Figure |3jb) has the form 

U„(M:a) xKxUJ(5:fe) X Ft X Gab)) , 

while the corresponding restructured f-representation is 

U, {{B:h) X Fi X Ua {{A:a) x E^ x G^b)) ■ 

The expressions Ea, Ft and Gat denote the f-representati- 
ons over the subtrees Ta, Tb and respectively 7348- 

The swap operator xa,b thus takes an f-representation 
where data is grouped first by A then B, and produces an 
f-representation grouped by B then A. Figure |4] gives an 
algorithm for xa,b that executes this regrouping efficiently. 
We use a priority queue Q to keep for each value a of at- 
tributes in A the minimal values b of attributes in B. This 
minimal value occurs first in the union Ua due to the order 
constraint of f-representations. We then extract the values 
h from the priority queue Q in increasing order to construct 
the union over them, and for each of them we obtain the 
pairing values a. When a value a is removed from Q, we 
insert it back into Q with the next value b in its union Ua- 

Except for the operations on the priority queue, the total 
time taken by the algorithm in any given iteration of the out- 
ermost loop is linear in the size of the input Sin plus the size 
of the output Sout- For each a in Sin and fe in Ua, the value 
a is inserted into the queue with key b once and removed 
once. There are at most \Sin\ such pairs (a, 6) and each of 
the priority queue operations runs in time 0(log 15*™!)- 

Example 8. The tree 7i in Figure [2] is transformed into 
75 by the operator Xitem, location. The effect of the operator 
on the f-representation amounts to regrouping it primarily 
by location instead of item, as illustrated in Example [T] □ 

3.2 Cartesian Product Operator 

Given two f-representations E\ and E2 over disjoint sets of 
attributes, the product operator x yields the f-representation 
E = El X E2 over the union of the sets of attributes of Ei 
and E2 in time linear in the sum of the sizes of Ei and E2 . 
If 7i and T2 are the input f-trees, then the resulting f-tree is 
the forest of 7i and 72 • It is easy to check that the relation 
represented by E is indeed the product of the relations of 
El and E2, and that this operator preserves the constraints 
on order of values, path constraint, and normalisation. 



foreach expression Sin over the part of T in Figure (Sfb) do 
create a new union Sout 
let Q be a min-priority-queue 

foreach {A: a) X Ea X IJj {(B:6) X Ft, X Gab) in Sin do 
let Ua be the union (Jt, {{B:b) X Fj, X Gab) 
let Pa be the first value b in the union Ua 
insert value a with key pa into Q 
while Q is not empty do 

let bmin be the minimum key in Q 
create a new union Vt 
foreach a in Q with key b^in do 
append (A: a) X Ea X Gab to V(,,„i„ 
remove a from Q 
if Pa is not the last value in Ua then 

update Pa to be the next value b in the union Ua 
insert value a with key pa into Q 
append {B:b„^in) x E,,^.^ x Vf,^.^^ to Sout 
replace Sin by Sout 



Figure 4: Algorithm for the swap operator xa,b- 

3.3 Selection Operators 

We next present operators for selections with equality con- 
ditions of the form A = B. Since equi-joins are equivalent 
to equality selections on top of products, and the product 
of f-representations is just their concatenation, we can eval- 
uate equality joins in the same way as equality conditions 
on attributes of the same relation, and do not distinguish 
between these two cases in the sequel. 

If both attributes A and B label the same node in T, 
then by construction of T the two attributes are in the same 
equivalence class, and hence the condition A — B already 
holds. If A and B are two distinct nodes labelled by A and 
B respectively in an f-tree T, the condition A = B implies 
that A and B should be merged into a single node labelled 
by the union of the equivalence classes of A and B. 

We propose two selection operators: the merge operator 
IJ-A,B, which can only be applied in case A and B are sibling 
nodes in T, and the absorb operator aA,B, which can only be 
applied in case A is an ancestor of B in T. For all other cases 
of A and B in T, we first need to apply the swap operator 
until we transform T in one of the above two cases. The 
reason for supporting these selection operators only is that 
they are simple, atomic, can be implemented very efficiently, 
and any selection can be expressed by a sequence of swaps 
and selection operators. We next discuss them in depth. 

The Merge Selection Operator jj,a.b merges the sibling 
nodes A and B of T into one node labelled by the attributes 
of A and B and whose children are those of A and B, see 
Figure [3jc). This operator preserves the path constraint, 
since the root-to-leaf paths in T are preserved in the result- 
ing f-tree. Also, normalisation is preserved: merging two 
nodes of a normalised f-tree produces a normalised f-tree. 
To preserve the value order constraint, node merging is im- 
plemented as a sort-merge join. Any f-representation over 
the relevant part of T has the form 

^1 = {[Ja {A:a) X Ea) X {\J, {B:b) X Fb), 

and change into 

<i>2 = [ja..a^b {A:a) X {B:b) X Ea X Fb, 



where the union in $2 is over the equal values a and b of the 
unions in $1. An algorithm for fJ,A,B needs one pass over the 
input f-representation to identify expressions like $1 , and for 
each such expression it computes a standard sort-merge join 
on the sorted lists of values of these unions. 

Example 9. Consider an f-tree that is the forest of 7i and 
71 from Figure [21 The two attributes with the same name 
item are siblings (at the topmost level). By merging them, 
we obtain the f-tree Tz,. Example [l] shows f- representations 
over the input and output f-trees of this merge operation. □ 

The Absorb Selection Operator aA,B absorbs a node 3 
into its ancestor A in an f-tree T, and then normalises the 
resulting f-tree. The labels of B become now labels of A. 

The absorption of B into A preserves the path constraint 
since all attributes in B remain on the same root-to-leaf 
paths. By definition, the absorb operator finishes with a 
normalisation step, thus it preserves the normalisation con- 
straint. Similar to the merge selection operator, it employs 
sort-merge join on the values of A and B and hence creates 
f-representations that satisfy the order constraint. 

In any f-representation, each union over B is inside a union 
over its ancestor A, and hence inside a product with a partic- 
ular value a of A. Enforcing the constraint A = B amounts 
to restricting each such union over B hj B = a, hy which 
it remains with only one or zero subexpression. This can 
be executed in one pass over the f-representation, and needs 
linear time in the input size. The subsequent normalisation 
also takes linear time. Both the absorption and the normali- 
sation only decrease the size of the resulting f-representation. 

For normalising the f-tree after merging B into A, we can 
use the normalisation operator 77 as described above. How- 
ever, if the original tree was normalised, it is sufficient to 
push up the subtrees of B as shown in Figure |3jd), but we 
may also need to push upwards some of the nodes Ci , . . . , C^ 
on the path between A and B. 

Example 10. Consider the selection A = C on the left- 
most f-tree below with relations over schemas {^4, B}, {B' , C} 
and {C',D}. Since A and C correspond to ancestor and 
respectively descendant nodes, we can use the absorb oper- 
ator to enforce the selection. When absorbing {C,C'} into 
A (middle f-tree), the nodes {B,B'} and D become inde- 
pendent and D can be pushed upwards (right f-tree): 
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The Selection with Constant Operator ctasc can be 
evaluated in one pass over the input f-representation E. 
Whenever we encounter a union [J^{{A:a) x Ea) in E, we 
remove all expressions {A: a) x Ea for which a^dc. If the 
union becomes empty and appears in a product with an- 
other expression, we then remove that expression too and 
continue until no more expressions can be removed. In case 
9 is an equality comparison, then all remaining A-values are 
equal to c and we can factor out the singleton { A : c) . 

For a comparison 9 different from equality, the f-tree re- 
mains unchanged. In case of equality, we can infer that all 
A-values in the f-representation are equal to c and thus the 



node A labelled by A is independent of the other nodes in 
the f-tree and can be pushed up as the new root. When 
computing the parameter s{T), we can ignore A since the 
only f-representation over it is the singleton {A:c}. 

3.4 Projection Operator 

Given an f-representation E, the projection operator tt^, 
where j4 is a list of attributes of E, replaces singletons {B : h) 
of type B ^ A with the empty singleton (). If an empty 
singleton appears in a product with other singletons, then it 
can be removed from E. Also, a union of empty singletons 
is replaced by one empty singleton. This procedure can be 
performed in one scan over the input f-representation E and 
trivially preserves the order constraint. 

We transform the input f-tree as follows. We first mark 
those attributes that are projected away without removing 
them from the f-tree. The set of attributes of an f-tree would 
then exclude the marked attributes. If a leaf node has all 
attributes marked, we may then remove the node and its 
attributes from the f-tree. This process is repeated until no 
more nodes can be removed. We do not remove inner nodes 
with all attributes marked for the following reason. Consider 
the f-tree T representing a path A — B — C and with depen- 
dency sets {A, B} and {B, C}. Now assume that we project 
away the attribute B. If we would completely remove B 
from T, the nodes A and C would become independent in 
the resulting f-tree, and we could then normalise it into a 
forest of nodes A and C. However, this is not correct. The 
nodes A and C still remain transitively dependent on each 
other. We therefore swap nodes such that those with all at- 
tributes marked become leaves, in which case we can remove 
them as explained above. The projection operator trivially 
preserves the path constraint and normalisation. 

4. QUERY OPTIMISATION 

In this section, we discuss the problem of query optimisa- 
tion for queries on f-representations. In addition to the op- 
timisation objective present in the standard (flat) relational 
case, namely flnding a query plan with minimal cost, the 
nature of factorised data calls for a new objective: from the 
space of equivalent f-representations for the query result, we 
would like to find a small, ideally minimal, f-representation. 

The operators described in Section [3] can be composed 
to define more complex transformations of f-representations 
over f-trees. Any select-project-join query can be evaluated 
by executing a sequence of these operators. Such a sequence 
of operators is called an f-plan and several f-plans may exist 
for a given query. In this section we introduce different cost 
measures for f-plans and algorithms for finding optimal ones. 

The products and selections with constant are the cheap- 
est on f-representations and can be evaluated first using the 
corresponding operators. Projection can only be evaluated 
when the nodes with no projection attributes are leaves of 
the f-tree, and in FDB they are deferred until the end. Most 
expensive are the equality selection operators and the re- 
structuring operators which make selections and projections 
possible. Their evaluation order is addressed in this section. 

A selection A = B can only be executed on an f-representa- 
tion over an f-tree T if the attributes A and B label nodes 
A and respectively B that are either the same, siblings, or 
along a same path in T. Otherwise, we first need to trans- 
form the f-representation. If A and B are in the same tree. 



we can e.g. repeatedly swap A with its parent until it be- 
comes an ancestor of B. If A and B are in disjoint trees of 
T (recall that T may be a forest), we can promote both of 
them as roots of their respective trees by repeatedly swap- 
ping nodes, and thus as siblings at the topmost level in the 
f-tree. To complete the evaluation, we apply a merge or 
absorb selection operator on the two nodes A and B. 

There are several choices involved in the evaluation of 
a conjunction of selection conditions: For each selection, 
should we transform the input f-tree, and consequently the 
f-representation, such that the nodes A and B become sib- 
lings or one the ancestor of the other? Is it better to push 
up A or B? What is the effect of a transformation for one 
selection on the remaining selections? The aim of FDB's 
optimiser is to find an f-plan for the given query such that 
the maximal cost of the sequence of transformations is low 
and the query result is well-factorised. 

4.1 CostofanF-Plan 

We next define two cost measures for f-plans. One mea- 
sure is based on the parameter s{T) that defines size bounds 
on factorisations over f-trees for any input database. The 
second measure is based on cardinality estimates inferred 
from the intermediate f-trees and catalogue information about 
the database, such as relation sizes and selectivity estimates. 
Both measures can be used by the exhaustive search proce- 
dure and the greedy heuristic for query optimisation pre- 
sented later in this section. 

Cost Based on Asymptotic Bounds. As discussed in 
Section [21 the size of any f-representation over an f-tree T 
depends exponentially on the parameter s{T), i.e., the size is 
in 0(|Dp^). Since the cost of each operator is quasilinear 
in the sum of sizes of its input and output, the parameter 
s{T) dictates it. For an f-plan / consisting of operations 
ui, . . . ,ujk that transform f-representations and their f-trees: 

Tinitial = % ^ Tl l-i ... ^ Tk = Tflnal, 

the evaluation time is 0(|Dp^^ ■ log |D|), where 
s{f) = ma.x{s{To),s{Ti),...,s{Tk)). 

The sizes of the intermediate f-representations thus dom- 
inate the execution time. Using this cost measure, a good 
f-plan is one whose intermediate f-trees 71 have small s(7i). 

In defining a notion of optimality for f-plans, we would 
like to optimise for two objectives, namely minimise s(/) 
and s(7finai). However, it might not be possible to optimise 
for both objectives <max and <s(T) ^■t the same time. In- 
stead, we set for an order on these objectives. We define the 
lexicographic order <max x <s{T) on f-plans consisting of 
the following orders: 

1. /i <max /2 holds if s(/i) < s(/2), and 

2- /i <s(T) /a holds if s(7i) < s(72), where 71 and T2 
are the f-trees of the query result computed by /i and 
/2 respectively. 
Given f-plans /i and /2, we consider /i better than /2 and 
write /i <max X <s(T) /2 if either (1) the most expensive 
operator in /i is less expensive than the most expensive 
operator in f^, or (2) their most expensive operators have 
the same cost but the cost of the result is smaller for /i . An 
f-plan /i for a query Q is optimal if there is no other f-plan 

/2 for Q such that /2 <max X <s{T) h- 

This notion of optimality is over f-plans consisting of op- 
erators defined in Section [S] Since these operators preserve 



f-tree normalisation, this also means that we consider opti- 
mality only over the space of possible normalised f-trees. 

Example 11. Consider the following f-plan evaluating the 
selection B = F on the leftmost f-tree, with dependency sets 
{A,B,C} and {D,E,F}. 
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The input f-tree and the output f-tree have both cost 1, as 
each root-to-leaf path is covered by a single relation. How- 
ever, the intermediate f-tree has cost 2 (as on the path from 
B to F each of B and F must be covered by a separate re- 
lation), so the cost of the f-plan is 2. An alternative f-plan 
starts by swapping F with its parent to obtain an interme- 
diate f-tree with cost 1, and then merges F with B. 
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Although both f-plans result in an f-tree with cost 1, 
latter f-plan has cost 1 while the former has cost 2. 
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Cost Based on Estimates. We can also estimate the 
cost of an f-plan computing the factorised query result for 
a query Q and database D using cardinality and selectivity 
estimates for D. 

Given an f-representation E over an f-tree T of a query Q 
and an attribute A in T, the number of yl-singletons in E is 
given by the size of the result of a query Qanc(A) on the input 
database D. This query is i^a.nc{A){Q), where anc(^) is the 
set of attributes labelling nodes from the root to the node 
of A in T [19]. For instance, in Example [ij the number of 
occurrences of any dispatcher in the first f-representation 
over the f-tree 7i is the number of combinations of values 
for item-location-dispatcher in the query result. 

The size of the factorisation E is then YIa&v \Qa-nc{A){^)\ 
over all attributes A in the projection list V of Q. The car- 
dinality of Qanc{A)(D) cau uow be estimated using known 
techniques for relational databases, e.g., [21]. The cost s(/) 
of an f-plan / can be estimated as the sum of the cost esti- 
mates of the intermediate and final f-trees. 

Given an f-tree T and database estimates, we need poly- 
nomial time in T to find s{T) using linear programming and 
to compute the size estimate. 

4.2 Exhaustive Search 

To find an optimal f-plan for an equi-join query we search 
the space of all possible normalised f-trees and all possible 
operators between the f-trees (thus represented as a directed 
graph where f-trees are nodes and operators are edges). An 
f-plan for a given selection query Q on an input over an f- 
tree Tin is any path / from Tin to some final f-tree Tfinai 
such that (1) the equivalence classes of Tfinai are the classes 
of Tin joined by the query equalities. 

The cost function s{f) defines a distance function on the 
space of f-trees: the distance from 7i to 72 is the minimum 
possible cost s{f) of an f-plan from 7i to 72- We are thus 



Finding an optimal f-tree for a random query on R relations 



Average costs of optimal f-trees for queries on R relations. 
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Figure 5: Experiment 1: Query optimisation on flat data, K equalities on R relations with yl = 40 attributes. 



searching for f-trees Tfinai which satisfy (1), are closest to 
Tin (2), and have smallest possible cost (3). We can use 
Dijkstra's algorithm to find distances of all f-trees from Tin' 
explore the space starting with the %„ and trying all al- 
lowed operators, processing the reached f-trees in the order 
of increasing distance from Tin- Then, among all f-trees sat- 
isfying (1), we pick one with the shortest distance from Tin, 
and among these we pick one with smallest cost. Then we 
output a shortest path from Tin to this f-tree. 

The complexity of the search is determined by the size of 
the search space. By successively applying operators to Tin, 
we rearrange its nodes (swap operator) or merge pairs of its 
nodes (merge and absorb operators). For each partition of 
attributes over nodes, there will be a cluster of f-trees with 
the same nodes but different shape, among which we can 
move (transitively) using the swap operator. By applying a 
merge or absorb operator, we move to a cluster whose f-trees 
have one fewer node. Since we can never split a node in two, 
any valid f-plan will only merge nodes which end up merged 
in Tfinai- For a query with k equality selections, there are 
at most ( ^^) — 0{k^) pairs of nodes we may merge and 
we perform at most k merges, so there are 0{k^^) reachable 
clusters. In a cluster with m nodes there are at most m™' 
f-trees. Since m will be always at most the size n of the 
initial f-tree Tin, the size of the search space is 0{k^''n"). 

4.3 Greedy Heuristic 

Our greedy optimisation algorithm restricts the search 
space for f-plans in two dimensions: (1) it only applies re- 
structuring operators to nodes that participate in selection 
conditions, and (2) it considers a standard greedy approach 
to join ordering, whereby at each step it chooses a join with 
the least cost from the remaining joins. 

The algorithm constructs an f-plan / for a conjunction of 
equality conditions as follows. For each condition involving 
two attributes labelling nodes A and B, we consider three 
possible restructuring scenarios: swapping one of the nodes 
A and B until A becomes the ancestor of B or the other 
way around, or bringing both A and B upwards until they 
become siblings. We choose the cheapest f-plan for each 
condition. This f-plan involves restructuring followed by a 
selection operator to perform the condition. We then order 
the conditions by the cost of their f-plans. The condition 
with the cheapest f-plan is performed first and its f-plan 
is appended to the overall f-plan /. We then repeat this 
process with the remaining conditions until we finish them. 



The new input f-tree is now the resulting f-tree of the f-plan 
of the previously chosen condition. 

In contrast to the full search algorithm, this greedy algo- 
rithm takes only polynomial time in the size of the input 
f-tree T. For each condition, there can be at most 0(|T|) 
swaps and each swap requires to look at all descendants of 
the swapped nodes to check for independence. Computing 
the resulting f-tree in each of the three restructuring cases 
would then need 0{\Tf). 



5. EXPERIMENTAL EVALUATION 

We evaluate the performance of our query engine FDB 
against three relational engines: one homebred in-memory 
(RDB) and two open-source engines (SQLite and Postgre- 
SQL). Our main finding is that FDB clearly outperforms 
relational engines for data sets with many-to-many relation- 
ships. In particular, in our experiments we found that: 

• The size of factorised query results is typically at most 
quadratic in the input size for queries of up to eight 
relations and nine join conditions (Figure [S] right). 

• Finding optimal f-trees for queries of up to eight rela- 
tions and six join conditions takes under 0.1 seconds 
(Figure [S] left). Finding optimal f-plans for queries on 
factorised data is about an order of magnitude slower. 
In contrast, the greedy optimiser takes under 5 ms 
(Figure |9} without any significant loss in the quality 
of factorisation (Figure [S| . 

• For queries on input relations, factorised query re- 
sults are two to six orders of magnitude smaller than 
their flat equivalents and FDB outperforms RDB by 
up to four orders of magnitude (Figure (Tjl . For the 
same workload SQLite performed about three times 
slower than RDB, and PostgreSQL performed three 
times slower than SQLite; both systems have addi- 
tional overhead of fully functioning engines. Also, RDB 
implements a hand-crafted optimised query plan. 

• The above observations hold for both uniform and Zipf 
data distributions, with a slightly larger gap in perfor- 
mance for the latter (Figure [7)|. 

• The evaluation of subsequent queries on input data 
representing query results has the same time perfor- 
mance gap, since the new input is more succinct as 
factorised representation than as relation. Figure |8] 
compares evaluation times for selection queries on (1) 
one relation, which can be trivially evaluated by a sin- 



Average costs of f-plans and resulting f-trees, computed by full search and greedy query optlmisers. Input f-trees have R = 4 relations, A = 10 attributes. 
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Figure 6: Experiment 2: Comparison of full-search and greedy query optimisers. 



gle scan of this relation, and on (2) the factorisation 
of that relation, which may require restructuring. 
• For one-to-many (e.g., key- foreign key) relationships, 
the performance gap is smaller, since the result sizes 
for one-to-many joins can only depend linearly on the 
input size and not quadratically as in the case of many- 
to-many joins and the possible gain brought by factori- 
sation is less dramatic. For instance, in the TPC-H 
benchmark, the joins are predominantly on keys and 
therefore the sizes of the join results do not exceed that 
of the relation with foreign keys. Factorised query re- 
sults are still more succinct than their relational repre- 
sentations, but only by a factor that is approximately 
the the number of relations in the query (experiments 
not plotted due to lack of space) . 
Competing Engines. We implemented FDB and RDB 
in CH — h for execution in main memory, using the GLPK 
package v4.45 for solving linear programs. FDB evaluation 
and optimisation are described in previous sections. We 
also used the lightweight query engine SQLite 3.6.22 tuned 
for main memory operation by turning off the journal mode 
and synchronisations and by instructing it to use in-memory 
temporary store. Similarly, we run PostgreSQL 9.1 with the 
following parameters: fsync, synchronous commit, and full 
page writes are off, no background writer, shared buffers 
and working memory increased to 12 GB. Both SQLite and 
PostgreSQL read the data in their internal binary format, 
whereas FDB and RDB use the plain text format. The 
relations are given sorted; this allows RDB to use optimal 
relational join plans implemented as multi-way sort-merge 
joins. For all engines we report wall-clock times (averaged 
over five runs) to read data from disk and execute the query 
plans without writing the result to disk. 
Experimental Setup. All experiments were performed 
on an Intel(R) Xeon(R) X5650 Quad 2.67GHz/64bit/32GB 
rurming VMWare VM with Linux 2.6.32/gcc4.4.5. 
Experimental Design. The flow of our experiments is as 
follows. We generate random data and queries, then repeat 
a number of times four optimisation and evaluation exper- 
iments and report averages of optimisation time, execution 
time, representation sizes, and quality of produced f-plans. 
We generate R relations and distribute uniformly A at- 
tributes over them. Each relation has a given number of 
tuples, each value is a natural number generated from 1 
to M using uniform or Zipf distribution. The queries are 
equi-joins over all of these relations. Their selections are 
conjunctions of K non-redundant equalities. 

For each generated query Q and database D, we do the 
following. In Experiment 1, we run the FDB optimiser to 
find an optimal f-tree T for the query result and report 
the optimisation time and the value of the parameter s{Q) 



that controls the size of the f-representation of (5(D) over 
T. In Experiment 3, we compute the result (5(D) using 
RDB, SQLite, and PostgreSQL, and the factorised query 
result using FDB. We then report on both the evaluation 
time and size of the result as the number of its singletons; a 
singleton holds an 8 byte integer. 

In Experiments 2 and 4, we consider new queries on top of 
results produced in Experiments 1 and 3 respectively. The 
new queries are also equi-joins, where the selections are con- 
junctions of L random (not already implied) equalities on 
attribute equivalence classes of T. 

For each new query Q' , we run the FDB optimiser to find 
an optimal f-plan to compute the result and the resulting 
f-tree of the query result. In Experiment 2, we report the 
optimisation time and quality of the computed f-plans with 
the exhaustive and greedy optimisation algorithms; here, we 
consider the cost of the f-plan defined by the parameter s(-) 
of the intermediate and final f-trees; in our experiments, the 
alternative cost estimate discussed in Section [4. II would lead 
to very similar choices of optimal f-plans. In Experiment 4, 
we execute the chosen f-plan with FDB and compute with 
the relational engines the selection conditions given by Q' 
on a single relation (5(D) computed in Experiment 3. We 
report the execution times and query result sizes. 

The parameters K and L are subject to K ~\- L < A, as. we 
can do at most A — \ non-trivial joins on A attributes. We 
run the experiments five times for each parameter setting. 
Experiment 1: Query optimisation on flat data. 

Figure[5]shows average times for optimising a query on flat 
data, and average costs s{T) for the chosen optimal f-tree T 
of the query result. For schemas with A = AQ attributes over 
R= 1, . . . , 8 relations, we optimised queries oi K = 1, . . . , 9 
equality selections. The cost s{T) of an optimal f-tree T is 
always 1 for queries of up to two relations. For i? > 3 and a 
sufficient number of joins we often get queries with optimal 
s{T) — 2 and in very rare cases s(T) > 2. This means 
that the sizes of f-representations for the query results are 
in most cases quadratic in the size of the input database 
even in the case of 9 equality selections on 8 relations. The 
optimiser searches a potentially exponentially large space of 
f-trees to find an optimal one, but runs under 1 second for 
queries with less than 8 joins on up to 8 relations. 
Experiment 2: Query optimisation on factorised data. 

Figure [6] shows the behaviour of query optimisation for 
factorised data. It shows the costs of the computed f-plans as 
well as the costs of the f-tree of the result computed by the f- 
plans for our full-search and greedy optimisation algorithms. 

The queries under consideration have L < 6 joins on f- 
representations resulting after K < 8 joins on 71 = 4 rela- 
tions with A = 10 attributes. The greedy optimiser gives 
optimal or nearly optimal results in most cases (by compari- 
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Figure 7: Experiment 3: Performance of query evaluation on flat relational data. For sizes (top row): FDB 
(solid red), RDB and SQLite and PostgreSQL (dashed green). For times (bottom row): FDB (solid red), 
RDB (dashed green), and SQLite (dotted blue); PostgreSQL is ca. 3 times slower than SQLite and not shown. 



son with the optimal outcome of full search). The exceptions 
are queries joining most attributes of an f-representation 
produced by a query with few joins (small K, large L). In 
all cases the average f-plan cost is between 1 and 2, which 
means that the f-plans produce factorisations of at most 
quadratic size even though we join 4 relations. The results 
also show that for small queries (small L) the cost of the 
optimal f-plan is dominated by the cost of the final f-tree. 
As the query size (i.e., L) grows, the result f-tree has less 
attribute classes and its cost is smaller than the cost of the 
f-plan (i.e., smaller than the cost of intermediate f-trees that 
we must process while evaluating the query). 

Figure [5] shows the execution times for both optimisers. 
The search space of possible f-trees grows exponentially with 
the number L of selections and also with the size of the 
input f-tree (i.e., with decreasing K). The performance 
of the full-search algorithm is proportional to the size of 
the search space; we process about Ik f-trees/second. The 
greedy heuristic is polynomial in both K and L, and in our 
scenario is 2-3 orders of magnitude faster than full search. 
Experiment 3: Query evaluation on flat data. 

We compared the performance of FDB, RDB, SQLite, and 
PostgreSQL for query evaluation on flat input data. Fig- 
ure [7] shows the result sizes and evaluation times for queries 
with up to four equality selections on three ternary relations 
of increasing sizes in two settings: data generated using a 
uniform distribution over the range [1, 100] (left column) and 
using a more skewed, Zipf distribution (middle column). 



The size gap between factorised and relational results is 
largest for queries with fewer equality selections, since the 
results are larger yet factorisable. The plots support the 
claim that the sizes are bounded by a power law, with a 
smaller exponent for FDB than for the relational engines. 

The rightmost column in Figure [7] considers queries with 
four relations, two binary relations of size S^ = 64 and two 
ternary relations of size 512 — 8'^, whose values are drawn 
from [1, 20] using uniform and Zipf distributions. This data- 
set is combinatorial in nature. Each equality selection in the 
query decreases the number of result tuples by a constant 
factor of 20, which is exhibited in the flat result size pro- 
duced by RDB. FDB factorises the up to 500M data values 
into less than 4k singletons for all considered queries. 

The execution time for all engines is approximately pro- 
portional to their result sizes except for the millisecond re- 
gion, where constant overhead dominates. SQLite performs 
consistently slightly worse than RDB, and PostgreSQL is 
about three times slower than SQLite. We used a timeout 
of 100 seconds, which prohibited the relational engines to 
complete in several cases (no plotted data points). 
Experiment 4: Query evaluation on factorised data. 

Figure [S] compares the performance of FDB and RDB for 
query evaluation on query results computed in Experiment 3 
and with f-plans computed in Experiment 2. The behaviour 
of SQLite and PostgreSQL closely follows that of RDB. 

FDB evaluates queries consisting of L selections on fac- 
torised representations. The quality of the resulting factori- 
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Figure 8: Experiment 4: Performance of FDB (solid lines) and RDB (dashed lines) for query evaluation 
on factorised data. RDB needs one scan over the input relation, whereas FDB may need restructuring the 
factorised input. The times and result sizes for SQLite and PostgreSQL are as for RDB and not plotted. 



Finding an f-plan for a random query with L equalities 
on an f-tree with R=4 relations, A=1 attributes and K equalities. 
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Figure 9: Experiment 2: Performance comparison 
of query optimisers. Slower series (top) correspond 
to full search, faster series (bottom) to greedy. 

sation is dictated by the quality of the f-plan. FDB uses 
the optimal f-plan found by the full-search optimiser. Ad- 
ditional experiments (not reported here) reveal that the f- 
plans found by the greedy optimiser can be up to 50% slower 
than the optimal f-plans. This is a good tradeoff, since the 
greedy optimiser runs fast even for large queries, while the 
full-search optimiser explores an exponential space. 

RDB just evaluates a selection with a conjunction of L 
equality conditions on the attributes of the input relation. 
This can be done in one scan over the input relation. For 
FDB, the cost of the f-plan may be non-trivial: the more 
the f-plan needs to unfold the f-representation, the more 
expensive the evaluation becomes. Figure |8] suggests that 
FDB only unfolds the f-representations to a small extent. 
Similar to query evaluation on flat data, FDB shows up 
to 4 orders of magnitude improvement over RDB for both 
evaluation time and result size. The gap closes once the size 
of the input data decreases to about 1000 tuples and both 
FDB and RDB perform in under 0.1 seconds. 

Experiments 2 and 4 show that using f-representations for 
data processing is sustainable in the sense that the quality 
of factorisations, in particular their compactness and sizes, 
does not decay with the number of operations on the data. 
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